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IMPROVED MULTIVARIATE NORMAL MEAN 
ESTIMATION WITH UNKNOWN COVARIANCE WHEN p IS 

GREATER THAN n 

By Didier Chetelat and Martin T. Wells^ 

Cornell University 

We consider the problem of estimating the mean vector of a p- 
variate normal (6, E) distribution under invariant quadratic loss, {S — 
S)'E~^((5 — 6), when the covariance is unknown. We propose a new 
class of estimators that dominate the usual estimator 5'^{X) = X. The 
proposed estimators of 9 depend upon X and an independent Wishart 
matrix S with n degrees of freedom, however, S is singular almost 
surely when p> n. The proof of domination involves the development 
of some new unbiased estimators of risk for the p> n setting. We also 
find some relationships between the amount of domination and the 
magnitudes of n and p. 

1. Introduction. Suppose a p-dimensional random vector X is observed 
which is normally distributed, with mean vector and unknown positive 
definite covariance matrix S, and we wish to estimate 9 under the invariant 
quadratic loss 

(1.1) L{e,6) = i6-ey^^\6-e). 

Since the covariance matrix S is unknown, a random matrix S is observed 
along with X , which is assumed to be independent of X, and has a Wishart 
distribution with n degrees of freedom, where p > n. In high-dimensional 
estimation problems, where p, the number of features, is nearly as large as 
or larger than n, the number of observations, the ordinary least squares 
estimator does not typically provide a satisfactory estimate of 0. 

Modern data sets are increasingly becoming characterized by a number 
of features that are much larger than the number of sample units (large-p. 
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small-n) in contrast to classical data sets where the number of sample units 
is often much larger than the number of random variables (small-p, large-n). 
Modern applications in the p> n setting include examples from microarrays, 
association mapping, proteomics, radiology, biomedical imaging, signal pro- 
cessing, climate modeling and finance. For instance, in the case of microarray 
data, the dimensionality is frequently in the thousands or beyond, while the 
sample size is typically in the order of tens. The large-p, small-n scenario 
poses challenges in most inferential settings. We are considering a canoni- 
cal setting. For the usual multivariate location-scale estimation problem let 
W = {Wi, . . . , Wp) denote an x p matrix of data {N is the number of 
observations and p the number of features), where Wi are taken from a p- 
dimensional normal distribution with mean vector 9 and covariance matrix 
H. In this article we let the X and S be the sample mean and covariance 
of the features, respectively. In the context of this notation, S = A^"^H and 
n = N-l. 

The usual estimator under invariant quadratic loss is do{^) = It is min- 
imax and admissible when p<2 and p<n. However, when p > 3 and p<n, 
So{X) remains minimax but is no longer admissible. Explicit improvements 
are known in the multivariate normal case [Berger and Bock (1976), Berger 
et al. (1977), Berger and Haff (1983), Gleser (1979, 1986), James and Stein 
(1961)] and in the case of elliptically symmetric distribution [Srivastava and 
Bilodeau (1989), Fourdrinier, Strawderman and Wells (2003)]. 

In this article we primarily concentrate on the case p> n and construct a 
class of estimators, depending on the sufficient statistics {X,S), of the form 

(1.2) 6{X,S)=X + giX,S), 

which dominate 6q{X) under invariant quadratic loss. Note that, although 
the loss in (1.1) is invariant, the estimate in (1.2) may not be [except for 
6o{X)]. This class generalizes several estimators studied previously for the 
multivariate normal distribution to the p < n setting [Berger and Bock 
(1976), Berger et al. (1977), Berger and Haff (1983), Gleser (1979, 1986), 
James and Stein (1961)]. Examples of estimators we study here in this set- 
ting extend the class of so-called Baranchik estimators and includes a new 
high-dimensional James-Stein estimator 

where < a < and 5^ is the Moore-Penrose inverse of S. 

The estimation of the inverse covariance matrix, namely, the precision 
matrix of a multivariate normal distribution has been an important 

problem in practical situations as well as from a theoretical perspective. But, 
when p> n, the Wishart-distributed sample covariance matrix is singular; in 
this case, one is tempted to construct estimators using the Moore-Penrose 
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generalized inverse . Recently there has been an increased interest in the 
problem of estimating the covariance matrix of large dimension given vari- 
ables of dimension larger than the number of observations [Bickel and Levina 
(2008), d'Aspremont, Banerjee and El Ghaoui (2008), Konno (2009), Ledoit 
and Wolf (2004), Levina, Rothman and Zhu (2008), Rothman et al. (2008)]. 
Our method of proof relies on an unbiased estimator of risk difference, say, 

p{X, S). Specifically, we show that, for g{X, S) of the form — ^^^x'S+^x^ 
the estimator S{X, S) = X + g{X, S) dominates X provided p{X, S) < 0. In 
the next section we present the main results and their proofs are given in 
Section 3. We need Stein's integration-by-parts identity [Stein (1981)] and 
the so-called Stein~Haff identity for the singular Wishart distribution. The 
Stein-Haff identity was derived by Half (1979) and Stein (1977) for the full 
rank Wishart distribution. A similar identity for the elliptically contoured 
model has been given by Fourdrinier, Strawderman and Wells (2003). We 
make some concluding comments in Section 4. 

For a matrix M, let M' denote its transpose, M'^ its Moore-Penrose 
pseudo-inverse and its componentwise derivative matrix, that is, the ma- 
trix such that = Moreover, let 5ij denote the Kronecker delta. 

2. Main results. Let X be a random vector distributed as Np{6,Ti) 
with unknown 6 and S. Suppose an estimator of S is available, say, S ~ 
Wishartp(n, S), with S independent of X. By definition of the Wishart dis- 
tribution, we can write S = Y'Y for some matrix normal Y ~ Nnxp{0,I ^Y,). 
An elementary property of this distribution is that S is (almost surely) in- 
vertible if p <n, and (almost surely) singular if p > 77, [cf. Srivastava and 
Khatri (1979)]. 

An usual estimator of 9 is 6'^{X,S) = X; however, it turns out that 
this estimator is inadmissible under quadratic loss. If some estimator 5 ~ 
Wishartp(r7, S) is available, with n>p> 3, 6^ is dominated by the so-called 
James-Stein estimator 



6'^{X,S) = j^l 

The main contribution of this article is to extend this type of result to a 
more general class of estimators in the p> n setting. 

For some positive, bounded and differentiable function r : M ^ M, define 
the Baranchik-type estimator 

f r{X'S+X)SS+ \ 

(2.1) 

= X + g{X,S), 

where / is the identity matrix and 5'"'" denotes the Moore-Penrose inverse 
of 5. This estimator generalizes the usual Baranchik (1970) estimator to the 
unknown covariance setting for p> n. 
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Theorem 1. Let min(p,n) > 3. Suppose that: 

(i) r satisfies < r < 

(ii) r is nondecreasing; and 

(iii) r' is bounded. 

Then under invariant quadratic loss, 6r dominates 6^ . 

Throughout the article we will use the expression tr(5'S'"'"), which of course 
equals min(?i,p). This notation allows us to simultaneously handle both 
the p> n and n>p cases. The condition min(p,?i) > 3 merely guarantees 
that condition (i) of Theorem 1 holds for some r and is reminiscent of the 
dimension cutoff in classical Stein estimation. 

Proof of Theorem 1. The hypotheses of the theorem imply that 
r is differentiable almost everywhere. Under invariant quadratic loss, the 
difference in risk between Sr and 6^ is given by 

Ae = Ee [{X + g{X, S) - 9)'T.-\X + g{X, S) - 9)] 
(2.2) - Ee[{X - e)'^~\X - 6)] 

= 2Ee\g{X, S)'T.-\X - 9)] + Ee[g{X, S)'^-'g{X, S)] . 

In order to show the domination result, we need to show that under the 
sufficient conditions on r, (2.2) is nonpositive for all 9. First, for the leftmost 
term of (2.2) it can be shown that 

2Ee[g{X, SY^^Hx - 9)] = 2Ee[dwxg{X, S)]. 

Fourdrinier, Strawderman and Wells (2003) give a more general form of this 
result in their Lemma l(i); it is essentially an extension of Stein's classical 
integration by parts identity. By using Lemma 2 in Section 3, we have that 

r{X'S+X)SS+X' 



2Ee[dwxg{X,S)] = -2Ee 



(2.3) 



div 



X- 



X'S+X 



-2En 



2r'{X'S+X) + r{X'S+X) 



tr(55+) 



X'S+X 

For the right term of (2.2), we find, through Lemma 3 in Section 3, 
Ee[g{X,S)'^-^g{X,S)] 

S+XX'S+S' 



Eg 



Ea 



ntr r\X'S+X) 



+ ti[Y'VY\r'^ {X'S+X) 



S+XX'S+S 
(X'S+X)^ 

SS+XX'S+ 



{X'S+Xf 
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The finiteness of the risk of 5r is guaranteed to hold by Theorem 2 in Sec- 
tion 3 for ah p and n. 

Now applying Lemma 1 in Section 3, we find 



Ee\ni.[r\X'S-X) j 



+ tr(y'Vy<^ r'^{X'S+X 



SS+XX'S 

' {x's+xY 



(2.4) 



Ea 



n- 



^{X'S+X) 
X'S+X 



Ar{X' S+ Xy {X' X) 



+ r\X'S+X) 



D-2tr(55+) +3 



X'S+X 



En 



\X'S+X) 



n + p-2tr(55+) + 3 



X'S+X 

Replacing (2.3) and (2.4) back into (2.2), we obtain 



Ar{X'S^X)r' {X'S+X) 



Af> = Ef, 



r\X'S+X) 



n + p-2tv{SS+) + 3 
X'S+X 

tr(55+) -2 



Mx's+x) 

- 4r'{X'S+X){l + r{X'S+X)} 



Since r is nonnegative and nondecreasing, it follows that —4:r'{X'S+X){l + 
r{X'S+X)} < 0. Finally, for the X and S such that r{X'S+X) / 0, 



^ r{X'S+X) < 



X'S+X 

2{tr{SS+)-2) 



X'S+X 
2(min(n,p) — 2) 



n + p — 2tr(S'5'+) + 3 n — 2min(n,p) + 3' 

Therefore, under the three sufficient conditions on r, it follows that < 
for any 6, that is, the domination result holds. □ 

In the p> n setting, we obtain the following two corollaries. 

Corollary 1. Forp > n > 3, 5r dominates 6^ under invariant quadratic 
loss for all r nondecreasing, differentiable and satisfying 

2(n-2) 



(2.5) 



< r < 



p — n + 3 
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Corollary 2 (James-Stein estimator with large p and small n). For 
p> n>3 and a € M, the JamesStein-like estimator 

(2.6) Sfi^^S)=(l-^^)x 



X'S+X^ 

dominates 5^ under invariant quadratic loss for all 

0<o<^'"-^>. 

p — n + 3 

Note that if p is only moderately larger than n, Corollary 1 implies that 
one can construct an estimator with substantial improvement over 5^. How- 
ever, in the ultra-high-dimensional setting the denominator in (2.5) could 
be quite large and, consequently, the amount of improvement over 6^ could 
be quite small. The estimator in (2.6) generalizes the classical James-Stein 
with unknown covariance matrix, 

which is, of course, restricted to the case p<n, for a € M+. In this setting, 
this result is consistent with previous bounds in Fourdrinier, Strawderman 
and Wells (2003) (where n — 1 is used instead of our n). 

3. Technical results and proofs. It remains to clarify several of the some- 
what technical computations used in the proof of Theorem 1. We provide 
them in this section; these computations are likely to be of independent in- 
terest and showcase several technical maneuvers that the reader could find 
useful in dealing with singular Wishart matrices. 

Proposition 1. Let Y be an n x p matrix, S = Y'Y , X a p vector and 
F = X'S+X. It then follows that 

( dS 1 

I OyaP J kl 



(iii) 



dF 
dY^p 
d{S+XX'SS+}ki 



(ii) --^ = -2{X'S^Y')^{S^X)p + 2{X'S^S^Y'U{I - 55+)X)^; 



dY^p 

{S+S+Y\^{{I-SS+)XX'SS+)pi 

- S+^{YS^XX'SS^)^, - {S+Y'),^{S+XX'SS+)pi 

+ {I-SS+),^{YS+S+XX'SS+)^i 
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+ {S+XX'),p{YS+)^, + (5+xx'y'),js+)^, 
+ {S+XX'S^Y')^^{I-SS+)^, 

- {S+XX'SS+),^{YS+)^i - {S+XX'SS+Y'),^{S+)^i. 
Proof. First, notice that from the usual chain-rule that 



This shows (i). 

Let A be a symmetric matrix and t G M, then 

This result was, it seems, first proved in Golub and Pereyra (1973), as their 
Theorem 4.3, but can be found in standard textbooks on elementary linear 
algebra. Also, again for A symmetric, we have AA& = A'^A and A{I — 
AA+) = {I-AA+)A = A+{I-AA+) = {I-AA+)A+ = 0. This easily follows 
from elementary properties of the Moore-Penrose pseudoinverse. 

Since S = Y'Y, notice through a singular value decomposition argument 
that SS+Y' = Y' and, thus, (/ - SS+)Y' = 0. Using (i), we find that 



dYai3 dYap 

= - Y,iX'S^)j,{6pkYai + 5piYak}iS^X)i 



k,l 



+ J2iX'S+S+),{5fSkY^i + 5^iY^k}iiI - SS+)X), 

k,l 

+ ^{X'il - SS+)),{6pkYai + 5piY^k}iS+S+X)i 

k,l 

- Y,{X'S+)pY^i{S+X\ - Y,{X'S+\Y^k{S+X)p 

I k 

+ J^(x'5+5+)^y«KU - ss+)x\ 

+ J2iX'S^S+),Y^k{{I - SS-^)X)p 
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+ Y,{X'{I-SS+))pY^i{S+S+X)^ 
I 

+ Y.{X'{I - SS+)),Yak{S+S+X)f, 

k 

= -2{X'S+Y')^{S+X)f, + 2{X'S+S+Y')^{{I - 55+)X)^ 
which gives (ii). 

Using (i), we have that for any conformable matrices A and B 



A^^B] =^Akii^^)' B 



dYa^ Jki Xj '^^^"/^^u 



= "^^AkiidfiiYaj + 5pjYai}Bji 

= ^ AkjsYajBji + ^ AkiYoiiBpi 

3 i 

= Akp{YBU + {AY'),^B^i. 

Therefore, using again (/ — SS^)Y' = 0, 

d{S+XX'SS+}M 
dY^fi 

^1 00+ 



<^ 5+5+ (/ - SS^)XX'SS^ 

I oYo,^ 



- S+ — — 5+XX'55+ + (/ - SS+) — — S+S+XX'SS+ 

dYal3 dYal3 

+ S+XX' T^5+ + 5+XX'55+5+ - 55+) 

dYa(3 dY^/s 

- 5+XX'55+ —^5+ + S+XX'sn - 55+) ——5+5+ 

dYaf} dYal3 

= {S+S+Y'),^{{I - SS+)XX'SS+)^i 

- S+^{YS+XX'SS+)^i - {S+Y'),^{S+XX'SS^)^i 
+ {I-SS+),^{YS+S+XX'SS+)^i 

+ (5+XX0,^(y5+),, + {S+XX'Y'),^{S^)^, 

+ (5+XX'5+y')fcaU-^5+)^, 

- (5+XX'55+),^(y5+)„, - (5+XX'55+yOfca(^+)/3z, 
which gives (iii). □ 



IMPROVED NORMAL MEAN ESTIMATION FOR P > N 
Lemma 1. Under the hypotheses of Theorem 1, we have 

= -4r(X'5+X)r'(X'5+X) + r'^{X'S+X) 



' Q+ v\J ( v> Q+ v\ 1 ^'ifvic+x\ ^ 2tr(S5"'")+3 

X'S+X 

)ij = dY~, 



where Vy is interpreted as the matrix with components (Vy)jj = ■qy~- 



Proof. To simplify computations, in what will follows, we let F = 
X'S+X. We then have 



a,/3 

(3.1) =2^(y'),„r(F)r'(F) 



dF {SS+XX'S+)pj 



(3.2) +53(y').„r^(F)<5«i™:^ 



i72 

o,/3 



(3.3) + Y.^Y\^'-^{F) {dF/dY^p){SS+XX'S+)p, ^ 



a, 13 



To simplify (3.1) and (3.3), we apply Proposition l(ii) to get 
= -2Y,{Y%,{X'S+Y'US+X)^{SS+XX'S+)^^ 

a, (5 

+ 2Y,iX'S+S+Y')jYU{S+XX'SS+)^^{iI - SS+)X)^ 

= -2X'S+X{SS+XX'S+)ij. 
Using this, we get for (3.1) 

(3.4) 

= -4r(F)r'(F)- 



j,^,{SS+XX'S+)ij 



F 
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and (3.3) becomes 

3.5 

This leaves the term (3.2) to analyze. Using Proposition l(iii) 



dYc 



^ d{S+XX'SS+},p 

^{{S+S+Y')^^YU{I - SS+)XX'SS+)p^ 

-S%{Y'\^{YS+XX'SS+)^p 

— {S'^Y') ■^Yai{S^ XX' SS'^) 

+ SS-^)^p{Y\^{YS-^S-^XX'SS^)^^ 

+ {S+XX')^p{Y\^{YS-^)^p 

+ (5+xx'y'),„y„.(5+)^^ 

+ {S+XX'S+Y')^^Y^,{I - SS+)pp 
-{S+XX'SS+)^p{Y'\^{YS+)^p 

-{S+XX'SS+Y')^^Y^,{S+)pp} 
{S+XX'SS~^{I - SS+))ij 

- {SS+XX'S+)ij - tr{S+XX'SS+){SS+)-- 
+ tr((/ - SS+)XX'SS+){S+)ij 

+ {SS+XX'S+)^j + tv{S+){SXX'S+)i^ 
+ tr(/ - SS+){SS+XX'S+)ij 

- {SS+XX'S+)^j - tr{S+){SXX'S+)^. 

{p - tr(55+) - 1){SS+XX'S+}^. - {X' S+ X){SS+}... 
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Next, applying this computation in (3.2), we obtain 

(3.6) 

Now we can combine (3.4), (3.6) and (3.5) together to complete the proof. 
That is, we have 



tr(^yVy|r2(F) |j 

= 5;|-4r(F)r'(F)i^^^l^^ 

i 

+ 4r2(F)- 



2 ^^{SS+XX'S+)i 



F2 

+ (p - tr(55+) - l)r\F)^^^^^^^^ - r\F)^^^^^' 



i?2 ^ ' F 

. 2/^^P-2tr(S5+) + 3 

= -4r(F)r'(F) + r^{F)- ^— ^ 

F 

as desired. □ 

Lemma 2. Under the hypotheses of Theorem 1 we have 

r{X'S+X)SS+X ^ I ,,^.o+^^^ t^(gg+)-2 

^^^^ -^nXS X)+r{XS X) . 

Proof. Again, to simplify computations, let us denote X' S~^X by F. 
We find 

r ,^,SS+X 
divxjr(F)— ^ 



V {SS+X), {d/dX,){{SS+X),} 

^ ^ ' dXi F ^ ' F 

i 

_^^p^{dF/dX,){SS+X\ 



F2 

1 {SS+X)i 



12 



D. CHETELAT AND M. T. WELLS 



r{F) 



F 

{{d/dx,)Y.^^iXkXiS+}{ss+x), 

{SS+X\ 



-Y,r'{F){{X'S+\ + {X'S^\} 



F 



+ r{F)^l^ - ^^p^ {{X'S+\ + {X'S+).i\-{SS+X\ 
F F^ 

= 2r'{F) + r{F) ^ ^' 

as desired. □ 

The following result is an extension of a result in Konno (2009). This 
type of result was first obtained by Kubokawa and Srivastava (2008) and 
then was extended by Konno (2009). In our generalization we make use of a 
divergence version of Stein's lemma that comes with somewhat weaker mo- 
ment conditions, rather than the element-by-element assumptions in Konno 
(2009). These weaker moment conditions allow us to cover the p equals n 
and n + \ cases. 

Lemma 3. Let Y ~ Nnxp{^,In ® S), let S = Y'Y which has, by defini- 
tion, a Wishartp(n, S) distribution, and let G{S) be apxp random matrix 
that depends on S. Let Vy be interpreted as the matrix with components 
(^Y)ij = aF~' ^'^'^ ^ symmetric positive definite square root of S, 
define Y = Y and H = AGA'^ . Then 

^[tr(S-i5G)] = S[ntr(G) + tr(y' VyG')] 
under the conditions 

(3.7) i?[|div^^^(^).vec(yi7)|]<oo, 
where vec(M) denotes the vectorization of a matrix M . 

Proof. Define S = Y'Y = A~^SA~^. Notice that, by construction, Y ~ 
NnxpiOi In <X) /p) — this means, by definition of the matrix normal distribu- 
tion, that vec(y) ~ Nnp{0, Inp)- We can write 



E[tr{SH)] = E 



= E[vec{Y) -veciYH)]. 

Using the divergence form of Stein's lemma, which can be found in Lemma A.l 
in Fourdrinier and Strawderman (2003), we obtain, under the moment con- 
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ditions outlined in (3.7), 

E[vec(y) • veciYH)] = ^[div^,,(y) vec{YH)] 



E 



E 



E 



d 



n 



This last expression can be expressed in a compact matrix form as 

E[iv{SH)]=E[niT{H)+iT{{Y'\Iy)'H)]. 
Finally, we notice 

E[tY{H)] = E[tY{AGA-% 
E[tT{SH)] = E[ti{A~^SGA~% 
E[tr{{Y'VyyH)] = EMAiY'VyyCA-^)], 
which concludes the proof. □ 

Theorem 2. Let Y ~ A^nxp(0, In ^ o,nd for A the symmetric positive 
definite square root ofT,, let 1" = 1^74""'^. Let r be any bounded differentiable 
nonnegative function r :M — >■ [0, Ci] with bounded derivative \r'\ < C2. Define 

and H = AGA^^ . Then for all p and n 

(3.8) i?[|div^^^(^)Vec(yF)|]<oo. 

Proof. We first compute div^^^^y^ vec(yil). As always, to ease nota- 
tion, we shall write F = X' S~^X. We have 

div^ec(y)vec(y/?) 



d 



n 



a, 3 



dY^ 
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(3.9) X { 2r{F)r'{F) 



dF {AS+XX'SS+A-^}ji 



V- . d{S+XX'SS+}ki , 1 
(3-10) + E ^^-^ QY, 

(3.11) - r\F){AS+XX'SS+A-^}.^ ^^^'^f^''^ 

We simplify each part of the expression. For (3.9), using Proposition l(ii), 
we find 

a,l3,i,j 

r{Fy{F) 



p2 

X {-{X'S^Y')^Y^^{AS^XX'SS^A~^}.^Aifi{S^X)^ 



(3.12) + {X' S^Y')J'^j{AS+ XX' SS^ A-^}^^Aip{{I - SS+)X)p} 

= -A^-^^^K!^(X' S+Y'Y A-^ AS+ XX' SS+ A~^ AS+ X) 
F^ 

+ 4 ^(-^>'^(-^) (^x'S^S^Y'YA-^AS^XX'SS^A~^A{I - SS+)X) 
= -Ar{F)r'{F). 
Similarly, for (3.11) 



E Y^^A,,r\F){AS^XX'SS^A-^}^^^-^^^ 



F^' 



= 4^^ E (^'^+^')a^^aj{^5+XX'55+A-l}.,^i^(5+X)^ 

(3.13) 

= 4-^(x'5+y'yA~^A5+xx'ss+A~^A5+x) 
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This leaves us with (3.10). Using Proposition l(iii), we obtain 

a,f},i,j k,l ^ 

_rHn X- YA,A;A-' 

a,l3,i,j,k,l 

X {(5+5+y), J(/ - SS+)XX'SS+)^i 

- S^i^{YS'^XX'SS'^)^i 

- {S+Y),^{S+XX'SS-^)^i 

+ {I-SS^\^{YS+S+XX'SS+)^i 

+ {S^XX\p{YS+)^, 

+ {S+XX'Y\^{S+)^i 

+ {S^XX'S+Y\M-SS+)^i 

- (5+XX'55+),^(y5+)„, 

- {S+XX'SS+Y%^{S+)^i} 

= ^ E {AMS^S+Y),^Y^,A,p{iI-SS+)xrsS+)^^Ai:^ 

a,l3,i,j,k,l 

- Y;^iYS+XX'SS+),,A~'A^S;,A,, 
-A,,iS+Y),^Y^jA^{S+XX'SS+)f,,A-' 
(3.14) + Y^^{YS+S+XX'SS+)^iA-^Ap{I - SS+)^,Akj 

+ Y^aiYS+)^iA~^A[s{XX'S+)^,Ak, 
+ A,k{S+XX'Y'),^Y^,Ap{S+)^iA^' 
+ A,k{S-^XX'S^Y\^Y^jAi^{I - SS^)^iA^' 

-y;^{ys+)^,a-'Ap{ss+xx's+)^,a,, 

-A,,{S^XX'SS+Y'),X,A^iS+)^,A-'} 

= -^-r^{tr(AS+S+Y'YA-^) ■ tv(A(I - SS+)XX' SS+ A'^) 
- tT{A-^Y'YS+XX'SS+A-^AS+A) 
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- tY{AS^Y'Y A~^) tY{AS^XX'SS^A~^) 

+ iv{A"^Y'YS+S+XX'SS+A~^A{I - SS+)A) 

+ tx{A~^Y'YS^A~^AXX'S+A) 

+ tx{AS^XX'Y'YA~^) ■ tx{AS^A-^) 

+ iv{AS^XX'S+Y'YA-^) tr(^(/ - 55+)^"^) 

- ix{A-^Y'YS^A-^ASS^XX'S^A) 

- iY{AS+XX'SS+Y'YA~^) tr(^5+^-^)} 

• {-X'S+X - tY{SS+) ■ X'S+X 

+ X'S^X + X'SS+X ■ tr(5+) + X' X ■ {p - tr(55+)) 

- X'S^X - X'SS^X ■ tr(5+)} 



F 



(p-tr(55+) - 1). 



Having re-expressed div^^^^y^ vec(yi?), we now need to bound it above. By 
virtue of (3.12), (3.13) and (3.14), we have 

^[|div.ec(y)vec(yi?)|] 



■ E 



(3.15) 



niT{H) +4- 



F 



{p - tT{SS^ 



r'^{F) 



F 



<Cf\3 + p-tr{SS+) + n\E 



4r(F)r'(F) 



+ 4C1C2. 



It only remains to show that E[j;] is finite. By definition of the Wishart 
matrix distribution, we can define a T ~ Wishartp(n, /„) such that S = ATA. 
Let T = H'DH be the spectral decomposition of T, with D = diag(Ai). 
Write the eigenvalues of r+ as A^, so that = diag(A^), and let X^^^ be 
the smallest nonzero eigenvalue of . The following two identities follow 
from Tian and Cheng (2004) [Theorem 1.1, equations (1.2) and (1.4)] and 
symmetry of T: 

{ATA)+ = {T+TA)+T+{AT+T)+, 

{T+TA)^{T+T) = (r+rA)+. 

Using these identities, we have 

X'S-^X = X'{ATA)+X = X'{T+TA)'^T+{AT+T)^X 
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y^{X'{T+TA)+H'}lXi 



k 



>X1 



X'{T+TA)^H'H{AT+T)^X 

X'{T+TA)^{T'^T){AT+T)^X 

= ^min ■ X'{T+TA)+{AT+T)+X. 

Applying Cauchy-Schwarz provides us with the bound 

X'{T+TA)^{T+TA)X < X' (T+TA)^ (AT+T)'^ XX' {AT+T){T+TA)X 

so that we then have 

111 1 



< 



F X'S+X - A 



< 



X'[T+TAy{AT+TyX 
1 X'AT+TAX 



\+. X'{T+TA)+{T+TA)X' 



To ease notation, let us write Q = AT+TA and R = (r+ryl)+(r+rA). Col- 
lecting the results together, we bound (3.15) by 



(3.16) 



<Cl\?, + p-2ii{SS+) + n\E 



1 X'QX 
A+. X'RX 



AC1C2. 



We now use some independence results. We can write the singular value 
decomposition of T as T = H'DH, but we can also write it as T = H[DiHi, 
where Hi is semi-orthogonal {HiH[ = I) and Di is the matrix of the positive 
eigenvalues of T. If T has full rank (i.e., n >p), then this coincides with the 
singular value decomposition of T. In the full rank case, Srivastava and 
Khatri (1979) [Section 3.4, equation (3.4.3)] provide the joint density of H 
and D = diag(di) in the standard Wishart case (which applies to T) as 



(3.17) 



fHMH,D) 



:C7(p,n)jZ)|("-P-i)/2 



etr 



-D 



■i<j 



9p{H) 



for constants C{p,n) and functions Qp. Therefore, H and D are independent. 
In the rank-deficient case {p> n), Srivastava (2003) (Section 3) provides an 
equivalent expression which, in the singular Wishart case, gives 



(3.18) 



fHi,Di{Hi,Di) 



K(p,n)|Z)i|(P-"-i)/2 



etr 



2 



i<j 



gn,p{Hl) 
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for constants K{p,n) and functions gn,p, so, again, we find Hi and Di inde- 
pendent by factorization. Now, A^j^^ is a function, in the full rank case (resp., 
rank-deficient case), of only (resp., D^^), and we can write T+T = H'H 
(resp., T^T = H[Hi), so A^j^^ and T+T are independent. Being functions 
of S, they are also both independent of X . Now, the nonzero eigenvalues of 
T"*" are the inverses of the nonzero eigenvalues of T, a general fact about 
Moore-Penrose pseudo-inverses. Therefore, denoting the largest eigenvalue 
of T as Amax, we can split up the expectations in (3.16) and get the bound 



Now, it follows from positive semi-definiteness of T that -E[Amax] ^ -^[ti'(^)]- 
If n > p, tr(r) ~ Xpn [cf- Muirhead (1982), Theorem 3.2.20] and so E[tv{T)] = 
pn < oo. If p > n, recall we can write T = Z'Z for Z -/V„xp(0, In Ip) by 
definition of the Wishart distribution; and ZZ' ~ Wishartn(p, /n) so that 
tr(r) = iv{ZZ') ~ XpnS so, again, ii^[tr(r)] = pn < oo. Therefore, in either 
case, £'[Amax] < < oo. 

We still have to check that the expectation involving X, Q and R in 
(3.19) is finite. Let r = rk(i2) = rk((5) = rk(S') and write the spectral decom- 
position of {T~^TA) as UAU', with A = diag(L, 0(p„r)) where L is the vec- 
tor of the r nonzero eigenvalues of {T~^TA). Then R = {T^TA)^{T'^TA) = 
[/diag(/,., 0(p_j,))L'"'; let us define thep x {p — r) matrix E = C/[0(p_r)xr-^(p-r)]') 
that is, so that RE = and E has full column rank p — r. Notice that QE = 

AT+TAU\G^p_r)^rI(p~r)\' = AU MJ'U\Q^p_r)^rI{p-T)\' = 0- Since Q and R 
are symmetric positive semidefinite, we can use results in Magnus (1990) 
[Theorem l(i) with A = Q and B = R] to conclude that 



This concludes the proof of the theorem. □ 

4. Numerical study. This section provides some numerical results to 
showcase the improvement in risk of the minimax estimator over the usual 
estimator. More precisely, we compared the James-Stein estimator in (2.6) 
given by 



and the usual estimator 6^ = X under invariant loss. (In addition, we consid- 
ered the positive James-Stein estimator to be discussed in Section 5.) The 
empirical approximations of the invariant risk of these estimators were plot- 



(3.19) <Cf\3+p- 2tr(55+) + n\E[\^^^]E 



'X'QX 
X'RX 



+ AC1C2. 
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ted for p = 10, 20, 50 and n 
were considered: 



2 n 

2 



1. Three covariance matrix structures 



Spiked: A diagonal matrix with the first p/2 diagonal elements equal to 1, 
and the last p/2 equal to 10. 

Autoregressive: Autoregressive covariance matrices of the form 



P 



P 
1 



P 
P 
1 



V 



\ 



for p = 0.5. 

Block diagonal: Block diagonal matrices with p/2 blocks of the form 
for p = 0.5. 



il 1) 



In all cases, the true mean was chosen as 9 oc (1,...,1). 

We remind the reader that the risk of the trivial estimator is always p, 
regardless of or S. With this in mind, we see from Figure 1 that in all 
six scenarios the pattern of domination of the new estimator is similar to 
one of the usual James-Stein estimators. Also note that, as predicted by the 
theoretical results, the domination decreases as the smaller n tends to p. 

5. Comments. An interesting property of the Moore-Penrose inverse is 
that for any A, AA& is the matrix that projects onto the subspace spanned 
by A (its column space). It follows that the proposed generalized Baranchik 
estimator can be expressed as 



(5.1) 



P.xX+ 1 



X'S^X 
riX'S^X) 



SS^X 



X'S+X 



PsX, 



where Ps = SS'^ and Pg± = I — SS'^ are the projection matrices onto the 
column space of S and its orthogonal complement, respectively. In terms 
of the kernel and image of the symmetric matrix S, Ker(P5x) =Im(5) and 
lm{Pgj_) = Ker(S'"*"). When p> n, this means we can interpret our estimator 
as applying shrinkage only on the component of X in the subspace spanned 
by our covariance matrix estimator S. In particular, note that the estimator 

PsSr{X, S) = {1— ^^x'S+x^ )PsX dominates PsX under invariant loss func- 
tion (1.1), since R{PsSr,e) - R{PsX, 6) = R{5r,e) - R{X, 6") > if r satisfies 
the conditions of Theorem 1. This suggests there might be an easier, more 
abstract proof of Theorem 1, one not relying on brute computations but 
on the already known full rank S case, although we have not been able to 
obtain such a result. 
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Spiked covariance, James-Stein Spilced covariance, positive James-Stein 



5 10 15 20 25 30 

iieii^ 

Autoregressive covariance, James-Stein 



? - 



15 

llell^ 



20 25 30 



Biocl< covariance, James-Stein 



10 15 20 25 30 



5 10 15 20 25 30 

iieii'^ 

Autoregressive covariance, positive James-Stein 



15 

llell^ 



20 25 30 



Biock covariance, positive James-Stein 



10 15 20 25 30 



Fig. 1. The risk function plots of and Si^'^ for a = (n — 2)/(p — n + 3) are in the 
left and right columns, respectively. The lines, from thinnest to thickest, are forp= 10,20 
and 50. The solid and dashed lines are, respectively, for n = p/2 and n=p — 1. 
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A natural extension of the James-Stein estimator, 6^^ in (2.6), is a positive- 
part-type James-Stein estimator. The form of the estimator in (5.1) suggests 



where 6+ = max(6, 0). Simulation evidence from Figure 1 suggests that for 
a = {n — 2)/{p — n + 3), 6^^^ dominates 6^^ under invariant loss. 

One of the interesting differences between the n> p and p> n cases is the 
reversal of the roles of p and n. This is essentially due to the distribution 
of the singular values of S. Recall that for S = ATA, T ~ Wp{n,In). We 
can write the singular value decomposition of T as T = H'DH, but we can 
also write it as T = H[DiHi, where Hi is semi-orthogonal [HiH[ = I) and 
Di is the matrix of the positive eigenvalues of T. If T has full rank (i.e., 
n'>p), this coincides with the singular value decomposition of T. In the full 
rank case the joint density of H and D is given in (3.17), whereas in the 
rank-deficient case {p > n) joint density is given by (3.18), from which stems 
the reversal of the roles of p and n. 

In the heteroscedastic normal mean estimation problem, James and Stein 
(1961) used the loss function that was weighted by the inverse of the vari- 
ances and, consequently, the problem is essentially transformed to the ho- 
moscedastic case under ordinary squared error loss. Similarly, in this article, 
we used the invariant loss function in (1.1), therefore skirting a somewhat 
subtle issue. In the heteroscedastic setting where there are differing coor- 
dinate variances, minimax estimation and Bayes (or empirical Bayes) esti- 
mates can be qualitatively different. It turns out that minimax estimators in 
general shrink most on the coordinates with smaller variances, while Bayes 
estimators shrink most on large variance coordinates. Brown (1975) shows 
that the James-Stein shrinkage estimator does not dominate the X when the 
largest variance is larger than the sum of the rest. Moreover, Casella (1980) 
points out that the James-Stein shrinkage estimator may not be a desir- 
able shrinkage estimator under heteroscedasticity even when it is minimax. 
Morris and Lysy (2012) and Brown, Nie and Xie (2013) give an excellent 
perspective on minimaxity of the shrinkage estimator from Bayes and empir- 
ical Bayes points of view. Consequently, it would be of interest to examine 
the shrinkage patterns of the proposed estimates in the case of a nonin- 
variant loss function and assess how well the invariant loss works for p> n 
applications. 

One can imagine an extension of the results of this article beyond the 
normal distribution setting. Consider a model with the joint density for 
{X, S) the form 



(5.2) 




(5.3) 



f{ti^-'[{x-e){x-9y + s]), 
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where the p x 1 location vector 9 and the p x p scale matrix S are unknown. 
In the setting of p < n, Fourdrinier, Strawderman and Wells (2003) and 
Kubokawa and Srivastava (2001) give some results on improved location 
estimation for elliptically symmetric distributions. For more on elliptical 
symmetry and the various choices of /(•) in (5.3), see Fang, Kotz and Ng 
(1990); the class in (5.3) contains models such as the multivariate normal, 
t- and Kotz- type distributions. 

Finally, simulation study reveals that, when p is much larger than n, 
the estimate of S and S"^ are quite poor. This observation agrees with 
Kubokawa and Srivastava (2008), where Haff (1979)-type improved esti- 
mates of S are proposed. It would be of interest to use an improved estimator 
of S in 6r{X, S) in (2.1). As pointed out in the testing context by Srivastava 
and Fujikoshi (2006) and Srivastava (2007), a shortcoming of is that the 
associated estimator is only orthogonally invariant, while the sample mean 
vector is invariant. 

Acknowledgments. The authors are grateful to the Associate Editor and 
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